Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Extract RegexDFAState class, RegexDFAStatePair class, and RegexDFAStateType enum into their own files. #57

Merged
merged 372 commits into from
Dec 11, 2024

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Dec 5, 2024

  • Doing a git diff will provide better comparisons for files with changed names than using the Github UI:
git fetch upstream
git fetch upstream pull/57/head:pr-57
git diff upstream/main pr-57 

Description

Separate DFA Functionality into different files.

  • RegexDFAState class moved into its own file.
  • RegexDFAStatePair class moved into its own file.
  • RegexDFAStateType enum moved into its own file.

Validation performed

Previously existing tests still succeed.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced new header files for managing DFA states and their types.
    • Added functionality for creating and managing DFA states, including methods for transitions and acceptance checks.
    • Implemented logic for computing intersections of DFAs.
  • Bug Fixes

    • Streamlined the DFA state management by consolidating redundant structures.
  • Documentation

    • Updated documentation to reflect new classes and methods for better clarity on DFA state management.

SharafMohamed and others added 30 commits October 24, 2024 11:48
…ransitions) return nullopt if state_ids is malformed.
…ake it clear to the reader that both failures are handled the same way and return nullopt. For more complicated return cases it would warrant the reader looking at the doc for the individual functions, but here I think we can make their life easier.
…nor are parts of the rules stored, instead the rules are only read and used to build the NFA.
…call succeeds in NFA's serialize.

Co-authored-by: Lin Zhihao <[email protected]>
…on classes when they are initialized in their constructor.
…egativeTaggedTransition classes into their own files.
…just an id. This object is created and owned by the capture AST, and other AST and NFA states point to these tags.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (3)
src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

35-35: Pass small integral type uint8_t by value instead of by const reference

The parameter byte in add_byte_transition is a uint8_t, which is more efficiently passed by value rather than by const reference. Passing small integral types by value avoids unnecessary indirection.

Apply this diff to pass byte by value:

-auto add_byte_transition(uint8_t const& byte, RegexDFAState<stateType>* dest_state) -> void {
+auto add_byte_transition(uint8_t byte, RegexDFAState<stateType>* dest_state) -> void {
src/log_surgeon/finite_automata/RegexDFA.hpp (1)

65-65: Use idiomatic condition in the while loop

Consider using !unvisited_pairs.empty() instead of false == unvisited_pairs.empty() for better readability and to align with common coding practices.

Apply this diff:

-    while (false == unvisited_pairs.empty()) {
+    while (!unvisited_pairs.empty()) {
src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1)

70-80: Offer assistance to implement UTF-8 transitions

The method get_reachable_pairs currently handles only single-byte transitions, as indicated by the TODO comment. Supporting UTF-8 (multi-byte) transitions is essential for full Unicode compatibility.

I can help implement support for UTF-8 transitions in this method. Would you like me to provide a solution or open a GitHub issue to track this enhancement?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 99b5b08 and 66ed13b.

📒 Files selected for processing (6)
  • CMakeLists.txt (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFA.hpp (2 hunks)
  • src/log_surgeon/finite_automata/RegexDFA.tpp (0 hunks)
  • src/log_surgeon/finite_automata/RegexDFAState.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFAStateType.hpp (1 hunks)
💤 Files with no reviewable changes (1)
  • src/log_surgeon/finite_automata/RegexDFA.tpp
✅ Files skipped from review due to trivial changes (1)
  • src/log_surgeon/finite_automata/RegexDFAStateType.hpp
🧰 Additional context used
📓 Learnings (3)
CMakeLists.txt (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
src/log_surgeon/finite_automata/RegexDFA.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
src/log_surgeon/finite_automata/RegexDFAState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
🔇 Additional comments (1)
CMakeLists.txt (1)

100-102: Addition of new source files is correct

The inclusion of the new source files RegexDFAState.hpp, RegexDFAStatePair.hpp, and RegexDFAStateType.hpp into the build system is appropriate. This ensures that the newly created classes are compiled and linked properly.

@SharafMohamed SharafMohamed changed the title refactor: Separate DFA functionality into different files. refactor: Extract RegexDFAState class, RegexDFAStatePair class, and RegexDFAStateType enum into their own files. Dec 6, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
src/log_surgeon/finite_automata/RegexDFA.hpp (2)

12-14: Consider addressing the TODO comment in this PR

The TODO suggests renaming RegexDFA to DFA. Since this PR is already refactoring the DFA-related code, it might be a good opportunity to implement this naming change as well.

Would you like me to help with renaming the class and updating all its references?


60-60: Track the UTF-8 handling requirement

The TODO comment indicates missing functionality for handling UTF-8 multi-byte transitions.

Would you like me to create a GitHub issue to track the UTF-8 handling requirement?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 66ed13b and 176391b.

📒 Files selected for processing (2)
  • src/log_surgeon/finite_automata/RegexDFA.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFAState.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/log_surgeon/finite_automata/RegexDFAState.hpp
🧰 Additional context used
📓 Path-based instructions (1)
src/log_surgeon/finite_automata/RegexDFA.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
src/log_surgeon/finite_automata/RegexDFA.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🔇 Additional comments (3)
src/log_surgeon/finite_automata/RegexDFA.hpp (3)

27-34: LGTM! Parameter type changed as suggested

The method signature has been updated to use a raw pointer instead of a unique_ptr, which is more appropriate for this use case where ownership transfer isn't needed.


40-52: LGTM! Clean implementation of state creation

The implementation properly manages memory using smart pointers and correctly handles the transfer of matching variable IDs from NFA to DFA states.


54-72: LGTM! Well-structured intersection algorithm

The implementation:

  • Correctly manages state pair traversal
  • Properly collects matching variable IDs
  • Follows coding guidelines with false == unvisited_pairs.empty()

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

55-55: Optimise memory usage by conditionally including m_bytes_transition

The member m_bytes_transition is always included in RegexDFAState, even when stateType is UTF8, which can lead to unnecessary memory usage. Consider conditionally including m_bytes_transition only when stateType is Byte, similar to how m_tree_transitions is conditionally included.

src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1)

71-71: Implement UTF-8 handling in get_reachable_pairs

There is a TODO comment indicating that UTF-8 (multi-byte transitions) are not yet handled in the get_reachable_pairs method. Implementing this functionality is important for full UTF-8 support. Would you like assistance in adding this implementation?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 176391b and 0decaf5.

📒 Files selected for processing (4)
  • examples/intersect-test.cpp (1 hunks)
  • src/log_surgeon/Lexer.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFAState.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/log_surgeon/Lexer.hpp
🧰 Additional context used
📓 Path-based instructions (3)
examples/intersect-test.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
src/log_surgeon/finite_automata/RegexDFAState.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🔇 Additional comments (2)
src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

63-65: ⚠️ Potential issue

Add bounds check on character in Byte state

Accessing m_bytes_transition[character] without validating character may lead to out-of-bounds access if character is greater than or equal to cSizeOfByte. Consider adding an assertion to ensure character is within bounds to prevent potential errors.

Apply this diff to add the assertion:

 if constexpr (RegexDFAStateType::Byte == stateType) {
+    assert(character < cSizeOfByte);
     return m_bytes_transition[character];
 }
examples/intersect-test.cpp (1)

45-45: Function call to get_intersect updated correctly

The get_intersect function call has been correctly updated to pass a raw pointer using dfa2.get(), aligning with the updated function signature. This ensures proper functionality.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring looks good to me. Let's make an agreement on the macro naming and then we can merge

@@ -0,0 +1,80 @@
#ifndef LOG_SURGEON_FINITE_AUTOMATA_REGEX_DFA_STATE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to miss this in previous refactor PRs: I think we should name macros to exactly match the file name, so this header should be LOG_SURGEON_FINITE_AUTOMATA_REGEXDFASTATE instead. We can create an issue to keep track of this and fix them all together later

Copy link
Contributor Author

@SharafMohamed SharafMohamed Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk sounds good, I'll create the issue. I was previously separating it on capitalization, e.g. log_surgeon/finite_automate/DfaState would use #ifndef LOG_SURGEON_FINITE_AUTOMATA_DFA_STATE as the correct snake_case naming for the separate words (as we're combining snake_case folder names and camal_case file names).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title looks good to me.
The macro naming issue is tracked here: #65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants